In this project, we will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that we will use has been provided by Bertelsmann Arvato Analytics, and represents a real-life data science task.
At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, we will report our findings for the completed section, as well as document the decisions that you made in your approach to each subtask.
# import libraries
from IPython.display import display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# magic word for producing visualizations in notebook
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = [16, 8]
plt.rcParams['figure.dpi'] = 200
There are four files associated with this project (not including this one):
Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md: Detailed information file about the features in the provided datasets.AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columnsEach row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. We will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so we'll need an additional argument in our read_csv() call to read in the data properly.
# Load in the general demographics data.
azdias = pd.read_csv('dataset/Udacity_AZDIAS_Subset.csv', sep=';')
# Load in the feature summary file.
feat_info = pd.read_csv('dataset/AZDIAS_Feature_Summary.csv', sep=';')
print(f'Shape of demographics dataset : {azdias.shape}\n')
Shape of demographics dataset : (891221, 85)
azdias.head(10)
| AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | ... | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 2 | 1 | 2.0 | 3 | 4 | 3 | 5 | 5 | 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | -1 | 1 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 |
| 2 | -1 | 3 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 |
| 3 | 2 | 4 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | ... | 2.0 | 2.0 | 2.0 | 0.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 |
| 4 | -1 | 3 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | ... | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 |
| 5 | 3 | 1 | 2 | 2.0 | 3 | 1 | 5 | 2 | 2 | 5 | ... | 2.0 | 3.0 | 1.0 | 1.0 | 1.0 | 5.0 | 5.0 | 2.0 | 3.0 | 3.0 |
| 6 | -1 | 2 | 2 | 5.0 | 1 | 5 | 1 | 5 | 4 | 3 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 5.0 | 5.0 | 4.0 | 6.0 | 3.0 |
| 7 | -1 | 1 | 1 | 3.0 | 3 | 3 | 4 | 1 | 3 | 2 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 2.0 | 5.0 | 2.0 |
| 8 | -1 | 3 | 1 | 3.0 | 4 | 4 | 2 | 4 | 2 | 2 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 3.0 | 3.0 | 2.0 | 4.0 | 3.0 |
| 9 | -1 | 3 | 2 | 4.0 | 2 | 4 | 2 | 3 | 5 | 4 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 3.0 | 3.0 | 2.0 | 3.0 | 1.0 |
10 rows × 85 columns
azdias.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891221 entries, 0 to 891220 Data columns (total 85 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AGER_TYP 891221 non-null int64 1 ALTERSKATEGORIE_GROB 891221 non-null int64 2 ANREDE_KZ 891221 non-null int64 3 CJT_GESAMTTYP 886367 non-null float64 4 FINANZ_MINIMALIST 891221 non-null int64 5 FINANZ_SPARER 891221 non-null int64 6 FINANZ_VORSORGER 891221 non-null int64 7 FINANZ_ANLEGER 891221 non-null int64 8 FINANZ_UNAUFFAELLIGER 891221 non-null int64 9 FINANZ_HAUSBAUER 891221 non-null int64 10 FINANZTYP 891221 non-null int64 11 GEBURTSJAHR 891221 non-null int64 12 GFK_URLAUBERTYP 886367 non-null float64 13 GREEN_AVANTGARDE 891221 non-null int64 14 HEALTH_TYP 891221 non-null int64 15 LP_LEBENSPHASE_FEIN 886367 non-null float64 16 LP_LEBENSPHASE_GROB 886367 non-null float64 17 LP_FAMILIE_FEIN 886367 non-null float64 18 LP_FAMILIE_GROB 886367 non-null float64 19 LP_STATUS_FEIN 886367 non-null float64 20 LP_STATUS_GROB 886367 non-null float64 21 NATIONALITAET_KZ 891221 non-null int64 22 PRAEGENDE_JUGENDJAHRE 891221 non-null int64 23 RETOURTYP_BK_S 886367 non-null float64 24 SEMIO_SOZ 891221 non-null int64 25 SEMIO_FAM 891221 non-null int64 26 SEMIO_REL 891221 non-null int64 27 SEMIO_MAT 891221 non-null int64 28 SEMIO_VERT 891221 non-null int64 29 SEMIO_LUST 891221 non-null int64 30 SEMIO_ERL 891221 non-null int64 31 SEMIO_KULT 891221 non-null int64 32 SEMIO_RAT 891221 non-null int64 33 SEMIO_KRIT 891221 non-null int64 34 SEMIO_DOM 891221 non-null int64 35 SEMIO_KAEM 891221 non-null int64 36 SEMIO_PFLICHT 891221 non-null int64 37 SEMIO_TRADV 891221 non-null int64 38 SHOPPER_TYP 891221 non-null int64 39 SOHO_KZ 817722 non-null float64 40 TITEL_KZ 817722 non-null float64 41 VERS_TYP 891221 non-null int64 42 ZABEOTYP 891221 non-null int64 43 ALTER_HH 817722 non-null float64 44 ANZ_PERSONEN 817722 non-null float64 45 ANZ_TITEL 817722 non-null float64 46 HH_EINKOMMEN_SCORE 872873 non-null float64 47 KK_KUNDENTYP 306609 non-null float64 48 W_KEIT_KIND_HH 783619 non-null float64 49 WOHNDAUER_2008 817722 non-null float64 50 ANZ_HAUSHALTE_AKTIV 798073 non-null float64 51 ANZ_HH_TITEL 794213 non-null float64 52 GEBAEUDETYP 798073 non-null float64 53 KONSUMNAEHE 817252 non-null float64 54 MIN_GEBAEUDEJAHR 798073 non-null float64 55 OST_WEST_KZ 798073 non-null object 56 WOHNLAGE 798073 non-null float64 57 CAMEO_DEUG_2015 792242 non-null object 58 CAMEO_DEU_2015 792242 non-null object 59 CAMEO_INTL_2015 792242 non-null object 60 KBA05_ANTG1 757897 non-null float64 61 KBA05_ANTG2 757897 non-null float64 62 KBA05_ANTG3 757897 non-null float64 63 KBA05_ANTG4 757897 non-null float64 64 KBA05_BAUMAX 757897 non-null float64 65 KBA05_GBZ 757897 non-null float64 66 BALLRAUM 797481 non-null float64 67 EWDICHTE 797481 non-null float64 68 INNENSTADT 797481 non-null float64 69 GEBAEUDETYP_RASTER 798066 non-null float64 70 KKK 770025 non-null float64 71 MOBI_REGIO 757897 non-null float64 72 ONLINE_AFFINITAET 886367 non-null float64 73 REGIOTYP 770025 non-null float64 74 KBA13_ANZAHL_PKW 785421 non-null float64 75 PLZ8_ANTG1 774706 non-null float64 76 PLZ8_ANTG2 774706 non-null float64 77 PLZ8_ANTG3 774706 non-null float64 78 PLZ8_ANTG4 774706 non-null float64 79 PLZ8_BAUMAX 774706 non-null float64 80 PLZ8_HHZ 774706 non-null float64 81 PLZ8_GBZ 774706 non-null float64 82 ARBEIT 794005 non-null float64 83 ORTSGR_KLS9 794005 non-null float64 84 RELAT_AB 794005 non-null float64 dtypes: float64(49), int64(32), object(4) memory usage: 578.0+ MB
azdias.describe()
| AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | ... | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 891221.000000 | 891221.000000 | 891221.000000 | 886367.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | ... | 774706.000000 | 774706.000000 | 774706.000000 | 774706.000000 | 774706.000000 | 774706.000000 | 774706.000000 | 794005.000000 | 794005.000000 | 794005.00000 |
| mean | -0.358435 | 2.777398 | 1.522098 | 3.632838 | 3.074528 | 2.821039 | 3.401106 | 3.033328 | 2.874167 | 3.075121 | ... | 2.253330 | 2.801858 | 1.595426 | 0.699166 | 1.943913 | 3.612821 | 3.381087 | 3.167854 | 5.293002 | 3.07222 |
| std | 1.198724 | 1.068775 | 0.499512 | 1.595021 | 1.321055 | 1.464749 | 1.322134 | 1.529603 | 1.486731 | 1.353248 | ... | 0.972008 | 0.920309 | 0.986736 | 0.727137 | 1.459654 | 0.973967 | 1.111598 | 1.002376 | 2.303739 | 1.36298 |
| min | -1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.00000 |
| 25% | -1.000000 | 2.000000 | 1.000000 | 2.000000 | 2.000000 | 1.000000 | 3.000000 | 2.000000 | 2.000000 | 2.000000 | ... | 1.000000 | 2.000000 | 1.000000 | 0.000000 | 1.000000 | 3.000000 | 3.000000 | 3.000000 | 4.000000 | 2.00000 |
| 50% | -1.000000 | 3.000000 | 2.000000 | 4.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | ... | 2.000000 | 3.000000 | 2.000000 | 1.000000 | 1.000000 | 4.000000 | 3.000000 | 3.000000 | 5.000000 | 3.00000 |
| 75% | -1.000000 | 4.000000 | 2.000000 | 5.000000 | 4.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 4.000000 | ... | 3.000000 | 3.000000 | 2.000000 | 1.000000 | 3.000000 | 4.000000 | 4.000000 | 4.000000 | 7.000000 | 4.00000 |
| max | 3.000000 | 9.000000 | 2.000000 | 6.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | ... | 4.000000 | 4.000000 | 3.000000 | 2.000000 | 5.000000 | 5.000000 | 5.000000 | 9.000000 | 9.000000 | 9.00000 |
8 rows × 81 columns
print(f'Shape of features attributes for the demographics data : {feat_info.shape}\n')
Shape of features attributes for the demographics data : (85, 4)
feat_info
| attribute | information_level | type | missing_or_unknown | |
|---|---|---|---|---|
| 0 | AGER_TYP | person | categorical | [-1,0] |
| 1 | ALTERSKATEGORIE_GROB | person | ordinal | [-1,0,9] |
| 2 | ANREDE_KZ | person | categorical | [-1,0] |
| 3 | CJT_GESAMTTYP | person | categorical | [0] |
| 4 | FINANZ_MINIMALIST | person | ordinal | [-1] |
| ... | ... | ... | ... | ... |
| 80 | PLZ8_HHZ | macrocell_plz8 | ordinal | [-1] |
| 81 | PLZ8_GBZ | macrocell_plz8 | ordinal | [-1] |
| 82 | ARBEIT | community | ordinal | [-1,9] |
| 83 | ORTSGR_KLS9 | community | ordinal | [-1,0] |
| 84 | RELAT_AB | community | ordinal | [-1,9] |
85 rows × 4 columns
feat_info.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 85 entries, 0 to 84 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 attribute 85 non-null object 1 information_level 85 non-null object 2 type 85 non-null object 3 missing_or_unknown 85 non-null object dtypes: object(4) memory usage: 2.8+ KB
feat_info.describe()
| attribute | information_level | type | missing_or_unknown | |
|---|---|---|---|---|
| count | 85 | 85 | 85 | 85 |
| unique | 85 | 9 | 5 | 9 |
| top | AGER_TYP | person | ordinal | [-1] |
| freq | 1 | 43 | 49 | 26 |
The feature summary file contains a summary of properties for each demographics data column. We will use this file to help you make cleaning decisions during this stage of the project. First of all, we should assess the demographics data in terms of missing data.
The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. We'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. We might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.
def convert_number_array(string_list):
# Split the string into a list of values, remove leading/trailing whitespace
number_list = [number.strip() for number in string_list[1:-1].split(',')]
# Check if the list contains a single empty string, return None in that case
if len(number_list) == 1 and number_list[0] == '':
return None
# Convert 'X' and 'XX' to their respective values, leave other values as integers
number_list = [int(number) if number not in {'X', 'XX'} else number for number in number_list]
return number_list
# Create a dictionary mapping attribute names to missing or unknown values
feat_info_dict = dict(zip(feat_info['attribute'], feat_info['missing_or_unknown']))
feat_info_dict = {key: convert_number_array(value) for key, value in feat_info_dict.items()}
# Copy the original DataFrame
cleaned_azdias = azdias.copy()
# Replace missing or unknown values with NaN
for attribute, missing_values in feat_info_dict.items():
if missing_values is not None:
cleaned_azdias[attribute].replace(missing_values, np.nan, inplace=True)
How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing. We will use matplotlib's bar() function to visualize the distribution of missing value counts to find these columns.
# Perform an assessment of how much missing data there is in each column of the dataset.
print(f'Total number of NaNs in dataset before conversion : {azdias.isna().sum().sum()}\n\n')
azdias.isna().sum()
Total number of NaNs in dataset before conversion : 4896838
AGER_TYP 0
ALTERSKATEGORIE_GROB 0
ANREDE_KZ 0
CJT_GESAMTTYP 4854
FINANZ_MINIMALIST 0
...
PLZ8_HHZ 116515
PLZ8_GBZ 116515
ARBEIT 97216
ORTSGR_KLS9 97216
RELAT_AB 97216
Length: 85, dtype: int64
print(f'Total number of NaNs in dataset after conversion : {cleaned_azdias.isna().sum().sum()}')
cleaned_values = cleaned_azdias.isna().sum().sort_values(ascending=False)
# Get only the most features with more than 70% of the mean for each feature
cleaned_values = cleaned_values[cleaned_values > 0.7 * cleaned_values.mean()]
Total number of NaNs in dataset after conversion : 8373929
cleaned_azdias.head(20)
| AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | ... | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 2.0 | 1 | 2.0 | 3 | 4 | 3 | 5 | 5 | 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | 1.0 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 |
| 2 | NaN | 3.0 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 |
| 3 | 2.0 | 4.0 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | ... | 2.0 | 2.0 | 2.0 | 0.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 |
| 4 | NaN | 3.0 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | ... | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 |
| 5 | 3.0 | 1.0 | 2 | 2.0 | 3 | 1 | 5 | 2 | 2 | 5 | ... | 2.0 | 3.0 | 1.0 | 1.0 | 1.0 | 5.0 | 5.0 | 2.0 | 3.0 | 3.0 |
| 6 | NaN | 2.0 | 2 | 5.0 | 1 | 5 | 1 | 5 | 4 | 3 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 5.0 | 5.0 | 4.0 | 6.0 | 3.0 |
| 7 | NaN | 1.0 | 1 | 3.0 | 3 | 3 | 4 | 1 | 3 | 2 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 2.0 | 5.0 | 2.0 |
| 8 | NaN | 3.0 | 1 | 3.0 | 4 | 4 | 2 | 4 | 2 | 2 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 3.0 | 3.0 | 2.0 | 4.0 | 3.0 |
| 9 | NaN | 3.0 | 2 | 4.0 | 2 | 4 | 2 | 3 | 5 | 4 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 3.0 | 3.0 | 2.0 | 3.0 | 1.0 |
| 10 | NaN | 3.0 | 2 | 1.0 | 2 | 2 | 5 | 3 | 1 | 5 | ... | 2.0 | 4.0 | 2.0 | 0.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 |
| 11 | NaN | 2.0 | 1 | 6.0 | 3 | 4 | 3 | 5 | 5 | 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | NaN | 3.0 | 1 | 6.0 | 5 | 3 | 4 | 2 | 4 | 1 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 5.0 | 5.0 | 3.0 | 6.0 | 4.0 |
| 13 | NaN | 1.0 | 2 | 5.0 | 1 | 4 | 3 | 5 | 5 | 2 | ... | 2.0 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 6.0 | 4.0 |
| 14 | NaN | 3.0 | 1 | 6.0 | 3 | 4 | 3 | 5 | 5 | 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | 1.0 | 4.0 | 2 | 4.0 | 4 | 1 | 5 | 1 | 1 | 4 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 8.0 | 5.0 |
| 16 | NaN | 1.0 | 2 | 1.0 | 4 | 3 | 1 | 4 | 5 | 1 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 3.0 | 4.0 | 1.0 | 2.0 | 1.0 |
| 17 | NaN | 2.0 | 1 | 6.0 | 3 | 4 | 3 | 5 | 5 | 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | NaN | 2.0 | 2 | 6.0 | 2 | 4 | 1 | 5 | 4 | 1 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 4.0 | 3.0 |
| 19 | NaN | 3.0 | 1 | 3.0 | 5 | 2 | 3 | 1 | 3 | 1 | ... | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 5.0 | 4.0 | 4.0 | 6.0 | 3.0 |
20 rows × 85 columns
# Investigate patterns in the amount of missing data in each column.
fig, ax = plt.subplots()
ax.bar(cleaned_values.index, cleaned_values)
plt.xticks(rotation=90)
fig.tight_layout()
ax.set_xlabel('Feature', fontsize=15)
ax.set_ylabel('Count', fontsize=15)
ax.set_title('Number of missing values for each feature', fontsize=20)
plt.show()
# Remove the outlier columns from the dataset.
outliers = ['TITEL_KZ', 'AGER_TYP', 'KK_KUNDENTYP', 'KBA05_BAUMAX', 'GEBURTSJAHR', 'ALTER_HH']
cleaned_azdias.drop(outliers, axis=1, inplace=True)
cleaned_feat_info = feat_info[~feat_info['attribute'].isin(outliers)]
cleaned_feat_info.reset_index(drop=True, inplace=True)
print(f'Shape of dataset after dropping outliers : {cleaned_azdias.shape}\n')
print(f'Shape of feature information : {cleaned_feat_info.shape}')
Shape of dataset after dropping outliers : (891221, 79) Shape of feature information : (79, 4)
Before parsing our feature summary file, we had 4,896,838 missing values which are recognized by pandas. After converting missing and unknown values with the help of feature summary file, we ended up with more than 8 million missing values(NaN).
The outlier features that were dropped from our dataset are:
Now, we'll perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, we should see some groups of points that have a very different numbers of missing values. Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.
Depending on what we observe in your comparison, this will have implications on how you approach our conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special. Either way, we will continue our analysis for now using just the subset of the data with few or no missing values.
# How much data is missing in each row of the dataset?
row_nan_count = cleaned_azdias.isna().sum(axis=1)
plt.style.use("seaborn-v0_8-whitegrid")
plt.hist(row_nan_count, facecolor = '#2ab0ff', edgecolor='#e0e0e0', linewidth=0.5)
plt.title('Rows\' Missing Data', fontsize=20)
plt.xlabel('NaN Count', fontsize=15)
plt.ylabel('Row Frequency', fontsize=15)
plt.tight_layout()
plt.show()
# Divide the data into two subsets based on the number of missing values in each row.
# Threshold chosen is 25
threshold = 25
azdias_above_thresh = cleaned_azdias[row_nan_count>threshold]
azdias_below_thresh = cleaned_azdias[row_nan_count<=threshold]
azdias_above_thresh.reset_index(drop=True, inplace=True)
azdias_below_thresh.reset_index(drop=True, inplace=True)
print(f'Shape of data subset with missing data above threshold = {azdias_above_thresh.shape}')
print(f'Shape of data subset with missing data below (and equal) threshold = {azdias_below_thresh.shape}')
Shape of data subset with missing data above threshold = (93260, 79) Shape of data subset with missing data below (and equal) threshold = (797961, 79)
cleaned_azdias.isna().sum().sort_values().head(30)
ZABEOTYP 0 SEMIO_REL 0 SEMIO_MAT 0 SEMIO_VERT 0 SEMIO_LUST 0 SEMIO_ERL 0 SEMIO_KULT 0 SEMIO_RAT 0 SEMIO_KRIT 0 SEMIO_DOM 0 SEMIO_KAEM 0 GREEN_AVANTGARDE 0 SEMIO_PFLICHT 0 FINANZTYP 0 FINANZ_HAUSBAUER 0 FINANZ_UNAUFFAELLIGER 0 FINANZ_ANLEGER 0 FINANZ_VORSORGER 0 FINANZ_SPARER 0 FINANZ_MINIMALIST 0 SEMIO_TRADV 0 ANREDE_KZ 0 SEMIO_SOZ 0 SEMIO_FAM 0 ALTERSKATEGORIE_GROB 2881 LP_STATUS_GROB 4854 LP_STATUS_FEIN 4854 ONLINE_AFFINITAET 4854 RETOURTYP_BK_S 4854 GFK_URLAUBERTYP 4854 dtype: int64
def compare_column_dist(data_1, data_2, data_1_annotation, data_2_annotation,
title, column):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6))
plt.style.use("seaborn-v0_8-whitegrid")
y_limit = max(data_1[column].value_counts().iloc[0], data_2[column].value_counts().iloc[0]) * 1.1
#sns.countplot(data_1[column], ax=ax1)
ax1.hist(data_1[column])
ax1.set_title(data_1_annotation)
ax1.set_ylabel('Frequency')
ax1.set_ylim(top=y_limit)
#sns.countplot(data_2[column], ax=ax2)
ax2.hist(data_2[column])
ax2.set_title(data_2_annotation)
ax2.set_ylabel('Frequency')
ax2.set_ylim(top=y_limit)
fig.suptitle(title, fontsize=20)
plt.tight_layout()
plt.show()
# Compare the distribution of values for at five columns where there are
# no missing values, between the two subsets.
# No missing values
annotate_1 = "Rows Above Threshold (25) NaN Values"
annotate_2 = "Rows Below Threshold (25) NaN Values"
first_comparison_title = 'Compare Distribution with Columns with NO missing values'
for column in ['ZABEOTYP', 'SEMIO_DOM', 'FINANZ_ANLEGER', 'FINANZ_SPARER', 'GREEN_AVANTGARDE']:
compare_column_dist(azdias_above_thresh, azdias_below_thresh,
annotate_1, annotate_2, first_comparison_title, column)
# Few missing values comparison
second_comparison_title = 'Compare Distribution between Columns with FEW missing values'
for column in ['ALTERSKATEGORIE_GROB', 'ONLINE_AFFINITAET', 'GFK_URLAUBERTYP']:
compare_column_dist(azdias_above_thresh, azdias_below_thresh,
annotate_1, annotate_2, second_comparison_title, column)
We chose a threshold of 25 (amount of missing values) to subset our data and we found that the data with lots of missing values are qualitatively different from data with no or few missing values.
We will mark those data as special. Hence, we will not drop these features.
Before applying Principal Component Analysis (PCA), we perform a crucial preprocessing step by identifying and removing highly correlated features from the dataset. This process helps reduce multicollinearity, improve interpretability, enhance stability, and make the subsequent PCA analysis more computationally efficient. By eliminating redundant information, we ensure that the resulting principal components capture the most meaningful and unique aspects of the data, ultimately leading to more reliable and insightful dimensionality reduction.
# Calculate the correlation matrix for numeric columns
corr_matrix = azdias_below_thresh.corr(numeric_only=True).abs()
# Create an upper triangular mask
upper_triangle_mask = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Use the mask to find highly correlated columns
columns_to_drop = [column for column in upper_triangle_mask.columns if any(upper_triangle_mask[column] > 0.95)]
# Create the correlation heatmap
plt.figure(figsize=(12, 10))
heatmap = sns.heatmap(azdias_below_thresh[columns_to_drop].corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Highly Correlated Features\n(Threshold = 0.95)', fontdict={'fontsize': 18}, pad=12)
plt.show()
# Drop the highly correlated columns from the DataFrame
azdias_below_thresh.drop(columns=columns_to_drop, inplace=True)
azdias_below_thresh.reset_index(drop=True, inplace=True)
# Update our feature information
cleaned_feat_info = cleaned_feat_info[~cleaned_feat_info['attribute'].isin(columns_to_drop)]
cleaned_feat_info.reset_index(drop=True, inplace=True)
C:\Users\Ziad\AppData\Local\Temp\ipykernel_6872\2943986441.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy azdias_below_thresh.drop(columns=columns_to_drop, inplace=True)
Checking for missing data isn't the only way in which we can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, we need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.
In the first two parts of this sub-step, we will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, we will create a new data frame with only the selected and engineered columns.
# Number of features of each data type?
cleaned_feat_info['type'].value_counts()
ordinal 49 categorical 15 numeric 6 mixed 5 Name: type, dtype: int64
# Checking ordinal and numeric values sanity
ord_num_feats = cleaned_feat_info.loc[cleaned_feat_info['type'].isin(['ordinal', 'numeric']), 'attribute']
ord_num_values = [azdias_below_thresh[column].dropna().unique() for column in ord_num_feats]
for value, feat in zip(ord_num_values, ord_num_feats):
print(f'Attribute: {feat} with unique values: {value}')
Attribute: ALTERSKATEGORIE_GROB with unique values: [1. 3. 4. 2.] Attribute: FINANZ_MINIMALIST with unique values: [1 4 3 2 5] Attribute: FINANZ_SPARER with unique values: [5 4 2 3 1] Attribute: FINANZ_VORSORGER with unique values: [2 1 5 4 3] Attribute: FINANZ_ANLEGER with unique values: [5 2 1 4 3] Attribute: FINANZ_UNAUFFAELLIGER with unique values: [4 3 1 2 5] Attribute: FINANZ_HAUSBAUER with unique values: [5 2 3 4 1] Attribute: HEALTH_TYP with unique values: [3. 2. 1.] Attribute: RETOURTYP_BK_S with unique values: [1. 3. 2. 5. 4.] Attribute: SEMIO_SOZ with unique values: [5 4 6 2 7 3 1] Attribute: SEMIO_FAM with unique values: [4 1 5 7 2 6 3] Attribute: SEMIO_REL with unique values: [4 3 2 7 5 1 6] Attribute: SEMIO_MAT with unique values: [3 1 2 4 7 5 6] Attribute: SEMIO_VERT with unique values: [1 4 7 2 6 5 3] Attribute: SEMIO_LUST with unique values: [2 4 6 7 3 1 5] Attribute: SEMIO_ERL with unique values: [2 6 7 4 5 1 3] Attribute: SEMIO_KULT with unique values: [3 4 6 5 7 1 2] Attribute: SEMIO_RAT with unique values: [6 4 3 2 7 5 1] Attribute: SEMIO_KRIT with unique values: [4 7 3 1 5 6 2] Attribute: SEMIO_DOM with unique values: [7 4 2 1 5 6 3] Attribute: SEMIO_KAEM with unique values: [4 7 5 2 3 6 1] Attribute: SEMIO_PFLICHT with unique values: [7 3 4 5 1 6 2] Attribute: SEMIO_TRADV with unique values: [6 3 4 2 7 5 1] Attribute: ANZ_PERSONEN with unique values: [ 2. 1. 0. 4. 3. 5. 6. 7. 8. 12. 9. 21. 10. 13. 11. 14. 45. 20. 31. 37. 16. 22. 15. 23. 18. 35. 17. 40. 38.] Attribute: ANZ_TITEL with unique values: [0. 1. 2. 4. 3. 6.] Attribute: HH_EINKOMMEN_SCORE with unique values: [6. 4. 1. 5. 3. 2.] Attribute: W_KEIT_KIND_HH with unique values: [3. 2. 6. 5. 4. 1.] Attribute: WOHNDAUER_2008 with unique values: [9. 8. 3. 4. 5. 6. 2. 7. 1.] Attribute: ANZ_HAUSHALTE_AKTIV with unique values: [ 11. 10. 1. 3. 5. 4. 6. 2. 9. 14. 13. 8. 7. 44. 25. 36. 16. 272. 33. 15. 12. 22. 21. 20. 67. 23. 18. 27. 24. 31. 39. 19. 30. 17. 29. 61. 54. 88. 49. 60. 38. 34. 48. 35. 62. 26. 45. 42. 28. 77. 47. 50. 236. 137. 43. 52. 41. 76. 63. 32. 40. 37. 59. 53. 57. 118. 66. 95. 78. 70. 46. 185. 73. 154. 80. 102. 92. 126. 94. 71. 68. 81. 69. 121. 258. 58. 86. 124. 85. 90. 128. 51. 72. 64. 56. 83. 150. 259. 91. 160. 134. 169. 246. 82. 114. 183. 116. 74. 143. 99. 216. 98. 125. 112. 123. 55. 120. 75. 141. 103. 87. 89. 223. 93. 179. 135. 149. 97. 101. 110. 79. 111. 109. 65. 84. 367. 162. 195. 107. 119. 331. 122. 161. 157. 175. 177. 136. 104. 316. 127. 100. 180. 155. 133. 178. 176. 108. 353. 131. 130. 96. 256. 117. 146. 211. 199. 140. 200. 106. 285. 263. 314. 253. 163. 377. 171. 148. 191. 243. 153. 145. 174. 172. 190. 142. 317. 244. 189. 286. 105. 170. 215. 321. 113. 115. 156. 129. 242. 132. 280. 159. 138. 158. 267. 151. 164. 187. 206. 266. 265. 209. 196. 139. 193. 184. 210. 260. 366. 194. 318. 344. 277. 595. 197. 247. 165. 283. 326. 181. 186. 438. 222. 144. 378. 237. 152. 208. 274. 430. 202. 214. 231. 219. 225. 230. 198. 224. 333. 168. 221. 523. 254. 305. 166. 252. 445. 147. 301. 213. 536. 220. 249. 290. 515. 276. 241. 204. 250. 218. 226. 201. 240. 404. 348. 328. 228. 304. 167. 346. 205. 207. 293. 255. 307. 414. 311. 347. 379. 395. 229. 232.] Attribute: ANZ_HH_TITEL with unique values: [ 0. 1. 5. 2. 3. 7. 4. 6. 9. 15. 14. 8. 11. 10. 12. 13. 20. 16. 17. 23. 18.] Attribute: KONSUMNAEHE with unique values: [1. 5. 4. 3. 2. 6. 7.] Attribute: MIN_GEBAEUDEJAHR with unique values: [1992. 1997. 2005. 2009. 1994. 1996. 2002. 2015. 1991. 1993. 1995. 2003. 2008. 2006. 2000. 1990. 2004. 1999. 1998. 2001. 2007. 2013. 1989. 2014. 2012. 2010. 1987. 2011. 1988. 1985. 2016. 1986.] Attribute: KBA05_ANTG1 with unique values: [0. 1. 4. 2. 3.] Attribute: KBA05_ANTG2 with unique values: [0. 3. 1. 4. 2.] Attribute: KBA05_ANTG3 with unique values: [0. 1. 2. 3.] Attribute: KBA05_ANTG4 with unique values: [2. 0. 1.] Attribute: KBA05_GBZ with unique values: [1. 3. 4. 5. 2.] Attribute: BALLRAUM with unique values: [6. 2. 4. 3. 7. 1. 5.] Attribute: EWDICHTE with unique values: [3. 4. 2. 5. 6. 1.] Attribute: INNENSTADT with unique values: [8. 4. 6. 1. 7. 3. 2. 5.] Attribute: GEBAEUDETYP_RASTER with unique values: [3. 4. 5. 1. 2.] Attribute: KKK with unique values: [2. 3. 4. 1.] Attribute: MOBI_REGIO with unique values: [1. 3. 4. 5. 2. 6.] Attribute: ONLINE_AFFINITAET with unique values: [3. 2. 1. 5. 4. 0.] Attribute: REGIOTYP with unique values: [3. 2. 5. 1. 7. 6. 4.] Attribute: KBA13_ANZAHL_PKW with unique values: [963. 712. 596. ... 2. 30. 7.] Attribute: PLZ8_ANTG1 with unique values: [2. 3. 1. 4. 0.] Attribute: PLZ8_ANTG2 with unique values: [3. 2. 4. 1. 0.] Attribute: PLZ8_ANTG3 with unique values: [2. 1. 3. 0.] Attribute: PLZ8_ANTG4 with unique values: [1. 0. 2.] Attribute: PLZ8_HHZ with unique values: [5. 4. 3. 2. 1.] Attribute: PLZ8_GBZ with unique values: [4. 3. 5. 2. 1.] Attribute: ARBEIT with unique values: [3. 2. 4. 1. 5.] Attribute: ORTSGR_KLS9 with unique values: [5. 3. 6. 4. 8. 2. 7. 9. 1.] Attribute: RELAT_AB with unique values: [4. 2. 3. 5. 1.]
For categorical data, we would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, we perform one of the following:
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?
categorical_feats = cleaned_feat_info[cleaned_feat_info['type'] == 'categorical']['attribute'].values.tolist()
encoded_columns = [feat for feat in categorical_feats if azdias_below_thresh[feat].nunique(dropna=True) > 2]
for feat in encoded_columns:
unique_values = azdias_below_thresh[feat].dropna().unique()
print(f'Feature: {feat} has unique values (excluding NaN): {unique_values}')
Feature: CJT_GESAMTTYP has unique values (excluding NaN): [5. 3. 2. 4. 1. 6.] Feature: FINANZTYP has unique values (excluding NaN): [1 6 5 2 4 3] Feature: GFK_URLAUBERTYP has unique values (excluding NaN): [10. 1. 5. 12. 9. 3. 8. 11. 4. 2. 7. 6.] Feature: LP_STATUS_FEIN has unique values (excluding NaN): [ 2. 3. 9. 4. 1. 10. 5. 8. 6. 7.] Feature: NATIONALITAET_KZ has unique values (excluding NaN): [1. 3. 2.] Feature: SHOPPER_TYP has unique values (excluding NaN): [3. 2. 1. 0.] Feature: ZABEOTYP has unique values (excluding NaN): [5 3 4 1 6 2] Feature: GEBAEUDETYP has unique values (excluding NaN): [8. 1. 3. 2. 6. 4. 5.] Feature: CAMEO_DEUG_2015 has unique values (excluding NaN): ['8' '4' '2' '6' '1' '9' '5' '7' '3'] Feature: CAMEO_DEU_2015 has unique values (excluding NaN): ['8A' '4C' '2A' '6B' '8C' '4A' '2D' '1A' '1E' '9D' '5C' '8B' '7A' '5D' '9E' '9B' '1B' '3D' '4E' '4B' '3C' '5A' '7B' '9A' '6D' '6E' '2C' '7C' '9C' '7D' '5E' '1D' '8D' '6C' '6A' '5B' '4D' '3A' '2B' '7E' '3B' '6F' '5F' '1C']
# Re-encode categorical variable(s) to be kept in the analysis.
# First change the CAMEO_DEUG_2015 to be numric only not strings
azdias_below_thresh['CAMEO_DEUG_2015'] = pd.to_numeric(azdias_below_thresh['CAMEO_DEUG_2015'], errors='coerce')
# Manual Encoding of OST_WEST_KZ
azdias_below_thresh['OST_WEST_KZ'] = azdias_below_thresh['OST_WEST_KZ'].map({'W': 1, 'O': 0, np.nan: np.nan})
# OneHotEncoding of multi-level categoricals
azdias_below_thresh = pd.get_dummies(azdias_below_thresh, columns=encoded_columns)
C:\Users\Ziad\AppData\Local\Temp\ipykernel_6872\3067772421.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
azdias_below_thresh['CAMEO_DEUG_2015'] = pd.to_numeric(azdias_below_thresh['CAMEO_DEUG_2015'], errors='coerce')
C:\Users\Ziad\AppData\Local\Temp\ipykernel_6872\3067772421.py:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
azdias_below_thresh['OST_WEST_KZ'] = azdias_below_thresh['OST_WEST_KZ'].map({'W': 1, 'O': 0, np.nan: np.nan})
We first checked the ordinal and numerical values and all of them are int (or float) which is fine for our analysis and algorithm after. We visualised the categorical features and found that CAMEO_DEUG_2015 feature contains number strings so these have to be converted to integer first before one hot encoding it. OST_WEST_KZ was manually one hot encoded as it's a binary feature.
Finally, all categorical features which have more that 2 values (excluding NaN) are to be OneHotEncoded.
There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to our own choices:
mixed_feats = cleaned_feat_info[cleaned_feat_info['type'] == 'mixed']['attribute'].values.tolist()
print(mixed_feats)
['LP_LEBENSPHASE_FEIN', 'PRAEGENDE_JUGENDJAHRE', 'WOHNLAGE', 'CAMEO_INTL_2015', 'PLZ8_BAUMAX']
# Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.
# Define mappings for decade and movement
decade_mapping = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 3, 8: 4, 9: 4, 10: 5, 11: 5, 12: 6, 13: 6, 14: 7, 15: 7}
movement_mapping = {1: 1, 2: 0, 3: 1, 4: 0, 5: 1, 6: 0, 7: 0, 8: 1, 9: 1, 10: 0, 11: 0, 12: 1, 13: 1, 14: 0, 15: 0}
# Apply the mappings directly to create new columns
azdias_below_thresh['PRAEGENDE_JUGENDJAHRE__DECADE'] = azdias_below_thresh['PRAEGENDE_JUGENDJAHRE'].replace(decade_mapping)
azdias_below_thresh['PRAEGENDE_JUGENDJAHRE__MOVEMENT'] = azdias_below_thresh['PRAEGENDE_JUGENDJAHRE'].replace(movement_mapping)
# Drop the original 'PRAEGENDE_JUGENDJAHRE' feature
azdias_below_thresh.drop(columns=['PRAEGENDE_JUGENDJAHRE'], inplace=True)
mixed_feats.remove('PRAEGENDE_JUGENDJAHRE')
# Investigate "CAMEO_INTL_2015" and engineer two new variables.
# Change feature "CAMEO_INTL_2015" to integers first
azdias_below_thresh['CAMEO_INTL_2015'] = pd.to_numeric(azdias_below_thresh['CAMEO_INTL_2015'], errors='coerce')
# Create a new column 'WEALTH' by extracting the 'tens'-place digit
azdias_below_thresh['CAMEO_INTL_2015__WEALTH'] = azdias_below_thresh['CAMEO_INTL_2015'] // 10
# Create a new column 'LIFE_STAGE' by extracting the 'ones'-place digit
azdias_below_thresh['CAMEO_INTL_2015__LIFE_STAGE'] = azdias_below_thresh['CAMEO_INTL_2015'] % 10
# Drop the original 'CAMEO_INTL_2015' feature
azdias_below_thresh.drop(columns=['CAMEO_INTL_2015'], inplace=True)
mixed_feats.remove('CAMEO_INTL_2015')
We handled the two mixed-features that required attention. In the next cell we will also feature engineer 2 more features from WOHNLAGE feature. This feature explains the neighborhood quality (or rural flag) which is useful for our algorithm. Also, we will OHE(OneHotEncode) PLZ8_BAUMAX as it could help us when measuring building type. The other feature LP_LEBENSPHASE_FEIN will be dropped as it is just Life stage, fine scale which we calculated earlier.
In order to finish this step up, we need to make sure that our data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:
Make sure that for any new columns that we have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, we should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, our data should only be from the subset with few or no missing values.
# Extra feature engineering task
# OHE PLZ8_BAUMAX
azdias_below_thresh = pd.get_dummies(azdias_below_thresh, columns=['PLZ8_BAUMAX'])
mixed_feats.remove('PLZ8_BAUMAX')
# Create binary features for quality categories
quality_categories = [1, 2, 3, 4, 5]
for category in quality_categories:
azdias_below_thresh[f'QUALITY_{category}'] = (azdias_below_thresh['WOHNLAGE'] == category).astype(int)
# Create a binary feature for rural vs. urban buildings
azdias_below_thresh['RURAL'] = ((azdias_below_thresh['WOHNLAGE'] == 7) | (azdias_below_thresh['WOHNLAGE'] == 8)).astype(int)
# Drop the original 'WOHNLAGE' feature
azdias_below_thresh.drop(columns=['WOHNLAGE'], inplace=True)
mixed_feats.remove('WOHNLAGE')
# Drop remaining mixed features
azdias_below_thresh.drop(columns=mixed_feats, inplace=True)
azdias_below_thresh.reset_index(drop=True, inplace=True)
Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.
def clean_data(df):
"""
Perform feature trimming, re-encoding, and engineering for demographics
data
INPUT: Demographics DataFrame
OUTPUT: Trimmed and cleaned demographics DataFrame
"""
df_new = df.copy()
# Replace missing or unknown values with NaN
for attribute, missing_values in feat_info_dict.items():
if missing_values is not None:
df_new[attribute].replace(missing_values, np.nan, inplace=True)
# Remove the outlier columns from the dataset
outliers = ['TITEL_KZ', 'AGER_TYP', 'KK_KUNDENTYP', 'KBA05_BAUMAX', 'GEBURTSJAHR', 'ALTER_HH']
df_new.drop(outliers, axis=1, inplace=True)
threshold = 25
df_new = df_new[row_nan_count<=threshold]
df_new.reset_index(drop=True, inplace=True)
# Drop the highly correlated columns from the DataFrame
df_new.drop(columns=columns_to_drop, inplace=True)
df_new.reset_index(drop=True, inplace=True)
# Change the CAMEO_DEUG_2015 to be numric only not strings
df_new['CAMEO_DEUG_2015'] = pd.to_numeric(df_new['CAMEO_DEUG_2015'], errors='coerce')
# Manual Encoding of OST_WEST_KZ
df_new['OST_WEST_KZ'] = df_new['OST_WEST_KZ'].map({'W': 1, 'O': 0, np.nan: np.nan})
# OneHotEncoding of multi-level categoricals
df_new = pd.get_dummies(df_new, columns=encoded_columns)
# Define mappings for decade and movement
decade_mapping = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 3, 8: 4, 9: 4, 10: 5, 11: 5, 12: 6, 13: 6, 14: 7, 15: 7}
movement_mapping = {1: 1, 2: 0, 3: 1, 4: 0, 5: 1, 6: 0, 7: 0, 8: 1, 9: 1, 10: 0, 11: 0, 12: 1, 13: 1, 14: 0, 15: 0}
# Apply the mappings directly to create new columns
df_new['PRAEGENDE_JUGENDJAHRE__DECADE'] = df_new['PRAEGENDE_JUGENDJAHRE'].replace(decade_mapping)
df_new['PRAEGENDE_JUGENDJAHRE__MOVEMENT'] = df_new['PRAEGENDE_JUGENDJAHRE'].replace(movement_mapping)
# Drop the original 'PRAEGENDE_JUGENDJAHRE' feature
df_new.drop(columns=['PRAEGENDE_JUGENDJAHRE'], inplace=True)
# Change feature "CAMEO_INTL_2015" to integers first
df_new['CAMEO_INTL_2015'] = pd.to_numeric(df_new['CAMEO_INTL_2015'], errors='coerce')
# Create a new column 'WEALTH' by extracting the 'tens'-place digit
df_new['CAMEO_INTL_2015__WEALTH'] = df_new['CAMEO_INTL_2015'] // 10
# Create a new column 'LIFE_STAGE' by extracting the 'ones'-place digit
df_new['CAMEO_INTL_2015__LIFE_STAGE'] = df_new['CAMEO_INTL_2015'] % 10
# Drop the original 'CAMEO_INTL_2015' feature
df_new.drop(columns=['CAMEO_INTL_2015'], inplace=True)
# OHE PLZ8_BAUMAX
df_new = pd.get_dummies(df_new, columns=['PLZ8_BAUMAX'])
quality_categories = [1, 2, 3, 4, 5]
for category in quality_categories:
df_new[f'QUALITY_{category}'] = (df_new['WOHNLAGE'] == category).astype(int)
# Create a binary feature for rural vs. urban buildings
df_new['RURAL'] = ((df_new['WOHNLAGE'] == 7) | (df_new['WOHNLAGE'] == 8)).astype(int)
# Drop the original 'WOHNLAGE' feature
df_new.drop(columns=['WOHNLAGE'], inplace=True)
# Drop remaining mixed features
df_new.drop(columns=mixed_feats, inplace=True)
df_new.reset_index(drop=True, inplace=True)
# Return the cleaned dataframe.
return df_new
Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features.
.fit_transform() method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since we'll be applying them to the customer demographics data towards the end of the project.# Create an instance of SimpleImputer with your mean strategy
imputer = SimpleImputer(strategy='mean')
azdias_imputed = imputer.fit_transform(azdias_below_thresh)
azdias_imputed = pd.DataFrame(azdias_imputed, columns=azdias_below_thresh.columns)
# Calculate NaN counts before and after imputation
nan_count_before = azdias_below_thresh.isna().sum().sum()
nan_count_after = azdias_imputed.isna().sum().sum()
print(f'NaN Count before imputation = {nan_count_before}\n')
print(f'NaN Count after imputation = {nan_count_after}')
NaN Count before imputation = 761265 NaN Count after imputation = 0
# Apply feature scaling to the general population demographics data.
scaler = StandardScaler()
final_azdias = pd.DataFrame(scaler.fit_transform(azdias_imputed), columns=azdias_imputed.columns)
We kept our missing data but have done a mean imputation on them, after that a scalar has been applied to the whole dataset in order for PCA to work properly.
On our scaled data, we are now ready to apply dimensionality reduction techniques.
plot() function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project.def scree_plot(pca):
explained_variance = pca.explained_variance_ratio_
cumulative_variance = explained_variance.cumsum()
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance, label='Individual Component Variance', alpha=0.7, color='skyblue')
plt.plot(range(1, len(explained_variance) + 1), cumulative_variance, label='Cumulative Variance', marker='o', markersize=2, linestyle='-', color='orange')
plt.xlabel('Principal Component Number')
plt.ylabel('Variance Explained (%)')
plt.title('Scree Plot: Explained Variance by Principal Components')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
# Apply PCA to the data.
pca = PCA()
pca.fit(final_azdias)
PCA()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA()
# Investigate the variance accounted for by each principal component.
scree_plot(pca)
In this specific Scree Plot, we observe that the cumulative explained variance reaches 100% (total variance) at around 175 principal components. However, to achieve an explained variance of approximately 95%, we only need between 110 and 130 components approximately.
Selecting this reduced set of principal components, while still explaining 95% of the data's variance, offers a significant reduction in dimensionality. This reduction streamlines subsequent analyses and modeling while retaining the essential information present in the data.
So, we will be applying PCA again but with 0.95 as argument and compare the number of components to our visualisation.
print(f'Number of features before PCA: {final_azdias.shape[1]}')
# Apply PCA to achieve 95% variance
pca = PCA(n_components=0.95)
pca.fit(final_azdias)
# Print the number of components for PCA while achieving 95% Variance
print(f'Number of components for PCA while achieving 95% Variance: {pca.n_components_}')
Number of features before PCA: 182 Number of components for PCA while achieving 95% Variance: 121
# Transform DataFrame using PCA
final_azdias_with_pca = pca.transform(final_azdias)
# Visualise the variance accounted for by each principal component with 95% variance.
scree_plot(pca)
We first fitted PCA without any arguments setting and we found that in order to keep 95% variance that the value for number of components is somehow between 110 and 130. After running PCA again but setting n_components=0.95, the result number of principal components are 121.
We managed to reduce features from originally 182 to 121.
Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.
As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.
def plot_component(component_weight, feature_names, title):
"""
Plot the top features for a single principal component with Seaborn styling.
Args:
component_weight (array-like): Weights for the principal component.
feature_names (list): List of feature names.
title (string): Title of the plot.
Returns:
None
"""
# Sort features by their absolute weights in descending order
sorted_features = sorted(zip(feature_names, component_weight), key=lambda x: abs(x[1]), reverse=True)
top_features = sorted_features[:10]
# Extract feature names and corresponding weights
feature_names, weights = zip(*top_features)
# Create a bar plot
plt.figure(figsize=(10, 6))
color = ['skyblue' if w >= 0 else 'lightcoral' for w in weights] # Positive weights in blue, negative in red
plt.barh(np.arange(len(feature_names)), weights, color=color)
plt.yticks(np.arange(len(feature_names)), feature_names)
plt.xlabel('Weight')
plt.title(title)
plt.gca().invert_yaxis()
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.rcParams['axes.edgecolor']='#333F4B'
plt.rcParams['axes.linewidth']=0.8
plt.rcParams['xtick.color']='#333F4B'
plt.rcParams['ytick.color']='#333F4B'
plt.tight_layout()
plt.show()
# Create a DataFrame to display the top features and their weights
top_features_df = pd.DataFrame({'Feature': feature_names, 'Weight': weights})
display(top_features_df)
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
feature_names = final_azdias.columns.tolist()
component_weights = pca.components_
plot_component(component_weights[0], feature_names, 'First Principcal Component Weights')
| Feature | Weight | |
|---|---|---|
| 0 | MOBI_REGIO | -0.205457 |
| 1 | PLZ8_ANTG3 | 0.194958 |
| 2 | PLZ8_ANTG1 | -0.194710 |
| 3 | KBA05_ANTG1 | -0.191092 |
| 4 | PLZ8_ANTG4 | 0.188818 |
| 5 | CAMEO_INTL_2015__WEALTH | 0.187478 |
| 6 | PLZ8_BAUMAX_1.0 | -0.183475 |
| 7 | FINANZ_MINIMALIST | -0.182912 |
| 8 | KBA05_GBZ | -0.182760 |
| 9 | HH_EINKOMMEN_SCORE | 0.176077 |
# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.
plot_component(component_weights[1], feature_names, 'Second Principcal Component Weights')
| Feature | Weight | |
|---|---|---|
| 0 | PRAEGENDE_JUGENDJAHRE__DECADE | -0.239654 |
| 1 | ALTERSKATEGORIE_GROB | 0.231984 |
| 2 | FINANZ_SPARER | -0.228029 |
| 3 | FINANZ_VORSORGER | 0.219796 |
| 4 | FINANZ_UNAUFFAELLIGER | -0.217578 |
| 5 | SEMIO_REL | -0.213131 |
| 6 | SEMIO_TRADV | -0.207335 |
| 7 | SEMIO_PFLICHT | -0.204418 |
| 8 | FINANZ_ANLEGER | -0.202930 |
| 9 | ZABEOTYP_3 | 0.197914 |
# Map weights for the third principal component to corresponding feature names
# and then print the linked values, sorted by weight.
plot_component(component_weights[2], feature_names, 'Third Principcal Component Weights')
| Feature | Weight | |
|---|---|---|
| 0 | ANREDE_KZ | -0.352909 |
| 1 | SEMIO_VERT | 0.327128 |
| 2 | SEMIO_KAEM | -0.325363 |
| 3 | SEMIO_DOM | -0.293393 |
| 4 | SEMIO_KRIT | -0.272109 |
| 5 | SEMIO_FAM | 0.262422 |
| 6 | SEMIO_SOZ | 0.258469 |
| 7 | SEMIO_KULT | 0.254445 |
| 8 | SEMIO_ERL | -0.208117 |
| 9 | SEMIO_RAT | -0.162015 |
For the first principal component we have the following:
For the second principal component we have the following:
For the third principal component we have the following:
We've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, we will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.
centers = range(10,31)
inertia_scores = []
for center in centers:
# run k-means clustering on the data.
kmeans = KMeans(n_clusters=center, n_init='auto')
kmeans.fit(final_azdias_with_pca)
# Add the inertia values
inertia_scores.append(kmeans.inertia_)
# Investigate the change in within-cluster distance across number of clusters.
plt.plot(centers, inertia_scores, linestyle='-', marker='o', color='b')
plt.title("Elbow Method for Optimal Number of Clusters - [Inertia vs k]", fontsize=16)
plt.xlabel("Number of Clusters (k)", fontsize=12)
plt.ylabel("Inertia", fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.tight_layout()
plt.show()
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans_optimal = KMeans(n_clusters=25, n_init='auto')
kmeans_optimal.fit(final_azdias_with_pca)
azdias_preds = kmeans_optimal.predict(final_azdias_with_pca)
We applied k-means with clusters ranging from 10 - 30 to pick the best number of clusters using Elbow method which accounts for the inertia score for the fitted data.
We then pick an optimal k-means with the right number of clusters to continue with for the rest of the project.
Although it's not a perfect Elbow diagram but we can see in the graph it's decreasing slowly near k=25 and that was our choice.
Now that we have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, we're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, we will interpret how the general population fits apply to the customer data.
;) delimited.clean_data() function we created earlier..fit() or .fit_transform() method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.# Load in the customer demographics data.
customers = pd.read_csv('dataset/Udacity_CUSTOMERS_Subset.csv', sep=';')
# Apply the same feature wrangling, selection, and engineering steps to the customer demographics.
final_customers = clean_data(customers)
C:\Users\Ziad\AppData\Local\Temp\ipykernel_6872\3402177374.py:21: UserWarning: Boolean Series key will be reindexed to match DataFrame index. df_new = df_new[row_nan_count<=threshold]
# Get the column names of each DataFrame
columns_azdias = set(azdias_below_thresh.columns)
columns_customers = set(final_customers.columns)
# Find the columns that exist in df1 but not in df2
columns_only_in_azdias = columns_azdias - columns_customers
# Find the columns that exist in df2 but not in df1
columns_only_in_customers = columns_customers - columns_azdias
if not columns_only_in_azdias and not columns_only_in_customers:
print("Both DataFrames have the same columns.")
else:
if columns_only_in_azdias:
print(f"Columns only in azdias and not in customers: {columns_only_in_azdias}")
if columns_only_in_customers:
print(f"Columns only in customers and not in azdias: {columns_only_in_customers}")
Columns only in azdias and not in customers: {'GEBAEUDETYP_5.0'}
# Add missing column
final_customers['GEBAEUDETYP_5.0'] = 0
# Reorder the columns after adding the new column
final_customers = final_customers[azdias_below_thresh.columns]
# Apply imputer to fill missing values in 'final_customers'
customers_imputed = pd.DataFrame(imputer.transform(final_customers), columns=final_customers.columns)
# Scale the imputed data
customers_scaled = pd.DataFrame(scaler.transform(customers_imputed), columns=customers_imputed.columns)
# Apply PCA transformation
customers_with_pca = pca.transform(customers_scaled)
# Make predictions
customers_preds = kmeans_optimal.predict(customers_with_pca)
At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.
Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.
Take a look at the following points in this step:
countplot() or barplot() function could be handy..inverse_transform() method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly.# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.
# Determine the number of clusters
num_clusters = max(max(azdias_preds), max(customers_preds)) + 1
# Count the occurrences of each cluster in both datasets
azdias_counts = np.bincount(azdias_preds, minlength=num_clusters)
customers_counts = np.bincount(customers_preds, minlength=num_clusters)
# Create an array of cluster indices
cluster_indices = np.arange(num_clusters)
bar_width = 0.35
fig, ax = plt.subplots(figsize=(10, 6))
bars1 = ax.bar(cluster_indices - bar_width/2, azdias_counts, bar_width, label='Demographics Predictions', alpha=0.7, color='b')
bars2 = ax.bar(cluster_indices + bar_width/2, customers_counts, bar_width, label='Customer Predictions', alpha=0.7, color='g')
ax.set_xlabel('Number of Cluster')
ax.set_ylabel('Count')
ax.set_title('Comparison of K-Means Clusters between Customer Data and Demographics Data')
ax.set_xticks(cluster_indices)
ax.legend()
plt.tight_layout()
plt.show()
# List of important features to be inspected
important_features = ['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'HH_EINKOMMEN_SCORE', 'RURAL', 'QUALITY_1',
'QUALITY_2', 'QUALITY_3', 'QUALITY_4', 'QUALITY_5', 'CAMEO_INTL_2015__WEALTH',
'CAMEO_INTL_2015__LIFE_STAGE', 'PLZ8_BAUMAX_1.0', 'PLZ8_BAUMAX_2.0', 'PLZ8_BAUMAX_3.0',
'PLZ8_BAUMAX_4.0', 'PLZ8_BAUMAX_5.0', 'REGIOTYP']
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
cluster_10 = kmeans_optimal.cluster_centers_[10]
over_pca_reversed = pca.inverse_transform(cluster_10).reshape(1, -1)
over_scaled_reversed = scaler.inverse_transform(over_pca_reversed).reshape(-1)
over_result = pd.DataFrame(over_scaled_reversed, columns=['Features'], index=final_customers.columns)
print(over_result.loc[important_features, 'Features'])
ALTERSKATEGORIE_GROB 1.869091 ANREDE_KZ 1.965022 HH_EINKOMMEN_SCORE 4.118406 RURAL 0.474269 QUALITY_1 0.047516 QUALITY_2 0.112978 QUALITY_3 0.235639 QUALITY_4 0.082051 QUALITY_5 0.041367 CAMEO_INTL_2015__WEALTH 2.366753 CAMEO_INTL_2015__LIFE_STAGE 3.359207 PLZ8_BAUMAX_1.0 0.953657 PLZ8_BAUMAX_2.0 0.018305 PLZ8_BAUMAX_3.0 0.002127 PLZ8_BAUMAX_4.0 -0.000844 PLZ8_BAUMAX_5.0 0.001536 REGIOTYP 4.259116 Name: Features, dtype: float64
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
cluster_12 = kmeans_optimal.cluster_centers_[12]
under_pca_reversed = pca.inverse_transform(cluster_12).reshape(1, -1)
under_scaled_reversed = scaler.inverse_transform(under_pca_reversed).reshape(-1)
under_result = pd.DataFrame(under_scaled_reversed, columns=['Features'], index=final_customers.columns)
print(under_result.loc[important_features, 'Features'])
ALTERSKATEGORIE_GROB 1.831009 ANREDE_KZ 0.953732 HH_EINKOMMEN_SCORE 5.581475 RURAL 0.026054 QUALITY_1 0.020037 QUALITY_2 0.088182 QUALITY_3 0.369533 QUALITY_4 0.303654 QUALITY_5 0.187990 CAMEO_INTL_2015__WEALTH 4.530094 CAMEO_INTL_2015__LIFE_STAGE 1.783870 PLZ8_BAUMAX_1.0 0.201109 PLZ8_BAUMAX_2.0 0.141738 PLZ8_BAUMAX_3.0 0.144613 PLZ8_BAUMAX_4.0 0.177179 PLZ8_BAUMAX_5.0 0.311444 REGIOTYP 4.974027 Name: Features, dtype: float64
From the figure cluster 10 for customer data is relatively overrepresented compared demographics data, while cluster 12 is the opposite.
Here is some remarks and findings regarding our clustered data
From cluster 10, we can infer that customers with the following features are the main target audience for the mail-order sales:
From cluster 12, we can infer that customers with the following features are the not necessarily target audience for the mail-order sales: